A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing

نویسندگان

Md. Geaur Rahman

Md Zahidul Islam

چکیده

Data pre-processing plays a vital role in data mining for ensuring good quality of data. In general data preprocessing tasks include imputation of missing values, identification of outliers, smoothening out of noisy data and correction of inconsistent data. In this paper, we present an efficient missing value imputation technique called DMI, which makes use of a decision tree and expectation maximization (EM) algorithm. We argue that the correlations among attributes within a horizontal partition of a data set can be higher than the correlations over the whole data set. For some existing algorithms such as EM based imputation (EMI) accuracy of imputation is expected to be better for a data set having higher correlations than a data set having lower correlations. Therefore, our technique (DMI) applies EMI on various horizontal segments (of a data set) where correlations among attributes are high. We evaluate DMI on two publicly available natural data sets by comparing its performance with the performance of EMI. We use various patterns of missing values each having different missing ratios up to 10%. Several evaluation criteria such as coefficient of determination (!), Index of agreement ("!) and root mean squared error (RMSE) are used. Our initial experimental results indicate that DMI performs significantly better than EMI.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine Learning Based Missing Value Imputation Method for Clinical Dataset

Missing value imputation is one of the biggest tasks of data pre-processing when performing data mining. Most medical datasets are usually incomplete. Simply removing the cases from the original datasets can bring more problems than solutions. A suitable method for missing value imputation can help to produce good quality datasets for better analysing clinical trials. In this paper we explore t...

متن کامل

An Ensemble approach on Missing Value Handling in Hepatitis Disease Dataset

The Major work in data pre-processing is handling Missing value imputation in Hepatitis Disease Diagnosis which is one of the primary stage in data mining. Many health datasets are typically imperfect. Just removing the cases from the original datasets can fetch added problems than elucidations. A appropriate technique for missing value imputation can assist to generate high-quality datasets fo...

متن کامل

Fuzzy Unordered Rules Induction Algorithm Used as Missing Value Imputation Methods for K-Mean Clustering on Real Cardiovascular Data

متن کامل

Using Classifier-Based Nominal Imputation to Improve Machine Learning

Many learning algorithms perform poorly when the training data are incomplete. One standard approach involves first imputing the missing values, then giving the completed data to the learning algorithm. However, this is especially problematic when the features are nominal. This work presents “classifier-based nominal imputation” (CNI), an easy-to-implement and effective nominal imputation techn...

متن کامل

Data Quality Improvement by Imputation of Missing Values

Having missing values in a data set is very common due to various reasons including human error, misunderstanding and equipment malfunctioning. Therefore, imputation of missing values is important to improve the quality of a data set. In our previous study we presented an imputation technique called DMI, which we then found better than an existing technique called EMI in terms of a few commonly...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing

نویسندگان

چکیده

منابع مشابه

Machine Learning Based Missing Value Imputation Method for Clinical Dataset

An Ensemble approach on Missing Value Handling in Hepatitis Disease Dataset

Fuzzy Unordered Rules Induction Algorithm Used as Missing Value Imputation Methods for K-Mean Clustering on Real Cardiovascular Data

Using Classifier-Based Nominal Imputation to Improve Machine Learning

Data Quality Improvement by Imputation of Missing Values

عنوان ژورنال:

اشتراک گذاری